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In many scientific and engineering applications, one has to solve not one but a sequence of instances 
of the same problem. Often times, the problems in the sequence are linked in a way that allows 
intermediate results to be reused. A characteristic example for this class of applications is given 
by the Genome-Wide Association Studies (GWAS), a widely spread tool in computational biology. 
GWAS entails the solution of up to trillions (10 12 ) of correlated generalized least-squares problems, 
posing a daunting challenge: the performance of petaflops (10 15 floating-point operations) over 
terabytes of data. 

In this paper, we design an algorithm for performing GWAS on multi-core architectures. This is 
accomplished in three steps. First, we show how to exploit the relation among successive problems, 
thus reducing the overall computational complexity. Then, through an analysis of the required 
data transfers, we identify how to eliminate any overhead due to input/output operations. Finally, 
we study how to decompose computation into tasks to be distributed among the available cores, 
to attain high performance and scalability. 

We believe the paper contributes valuable guidelines of general applicability for computational 
scientists on how to develop and optimize numerical algorithms. 

Categories and Subject Descriptors: G.4 [Mathematical Software]: — Algorithm design and 
analysis, Efficiency 

General Terms: Algorithms, Performance 

Additional Key Words and Phrases: Numerical linear algebra, sequences of problems, shared- 
memory, out-of-core, genome-wide association studies 



1. INTRODUCTION 

Many traditional linear algebra libraries, such as LAPACK [Anderson et al. 1999] 
and ScaLAPACK [Blackford et al. 1997], and tools such as Matlab, focus on provid- 
ing efficient building blocks for solving one instance of many standard problems. By 
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contrast, engineering and scientific applications often originate multiple instances 
of the same problem, thus leading to a — possibly very large — sequence of indepen- 
dent library invocations. The drawback of this situation is the missed opportunity 
for optimizations and data reuse; in fact, due to their black-box nature, libraries 
cannot avoid redundant computations or exploit problem-specific properties across 
invocations. The underlying theme of this paper is that a solution scheme devised 
for a specific sequence may be much more efficient than the repeated execution of 
the best single-instance routine [Bientinesi et al. 2010; Di Napoli et al. 2012]. 

Genome-wide Association Studies (GWAS) clearly exemplify this issue, posing 
a formidable computational challenge: the solution of billions, or even trillions, of 
correlated generalized least-squares (GLS) problems. As described below, a solution 
based on the traditional black box approach is entirely unfeasible. With this paper 
we demonstrate that by tackling all the problems as a whole, and by exploiting 
application specific knowledge in combination with a careful parallelization scheme, 
it becomes possible for biologists to complete GWAS in matter of hours. 

Any algorithm to perform GWAS needs to address the following issues. 

— Complexity. In a representative study, one has to solve a grid ofmx( gen- 
eralized least-squares problems, where m and t are of the order of millions and 
hundreds of thousands, respectively. The complexity for the solution of one prob- 
lem (of size n) in isolation is 0(n 3 ) floating point operations (flops), adding up to 
0{mtn 3 ) flops for the entire study. In the case of the largest problem addressed in 
our experiments (Section 6; m — 10 6 , t = 10 5 , and n = 10 3 ), this approach would 
require the execution of roughly 10 23 flops; even if performed at the theoretical 
peak of Sequoia, the fastest supercomputer in the world, 1 the computation would 
take more than 2 months. Here we illustrate how, by exploiting the structure 
that links different problems, the complexity reduces to 0(mtn), and the same 
experiment on a 40-core node completes in about 12 hours. 

— Size of the datasets. GWAS entails the processing of terabytes of data. 
Specifically, the aforementioned analysis involves reading 10 GBs and writing 
3.2 TBs. Largely exceeding even the combined main memory of current typi- 
cal clusters, these requirements demand an out-of-core mechanism to efficiently 
handle datasets residing on disk. In such a scenario, data is partitioned in slabs, 
and it becomes critical to determine the most appropriate traversal, as well as 
the size and shape of each slab. The problem is far from trivial, because these 
factors affect the amount of transfers and computation performed, as well as the 
necessary extra storage. By modeling all these factors, we find the best traversal 
direction, and show how to determine the shape and size of the slabs to achieve 
a complete overlap of transfers with computations. As a result, irrespective of 
the data size, the efficiency of our in-core solver is sustained. 
Parallelism. The partitioning of data in slabs that fit in main memory translates 
to the computation of the two-dimensional grid of GLSs in terms of sub-grids or 
tiles of such problems. The computation of each tile must be organized to exploit 
multi-threaded parallelism and to attain scalability even with a large number of 
cores. To this end, we present a study on how to decompose the tiles into smaller 



1 http://www.top500.org/, as of October 2012. 
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computational tasks — to take advantage of the cache memories — and how to 
distribute such tasks among the computing cores. 

Our study of these three issues is not specifically tied to GWAS; we keep the 
discussion in general terms, to concentrate on the methodology rather than on the 
problem at hand. We believe this paper contributes valuable guidelines for compu- 
tational scientists on how to optimize numerical algorithms. For the specific case 
of GWAS, thanks to the combination of a lower complexity, a perfect overlapping 
of data movement with computation, and a nearly perfect scalability, we outper- 
form the current state-of-the-art tools, GcnABEL and FaST-LMM [Aulchcnko et al. 
2007; Lippert et al. 2011], by a factor of more than 1000. 



Section 2 introduces GWAS both in biology and linear algebra terms. In Sec- 
tion 3, we detail how our algorithm for GWAS exploits application-specific knowl- 
edge to reduce the asymptotical cost of the best existing algorithms, while in Sec- 
tion 4 we analyze the required data transfers, and discuss the application of out-of- 
core techniques to completely hide the overhead due to input/output operations. 
In Section 5, we tailor our algorithm to exploit shared-memory parallelism, and we 
provide performance results in Section 6. Future work is discussed in Section 7, 
and conclusions are drawn in Section 8. 

2. MULTI-TRAIT GENOME-WIDE ASSOCIATION STUDIES 

In a living being, observable characteristics — traits or phenotypes — such as eye 
color, height, and susceptibility to disease, are influenced by information encoded in 
the genome. The identification of specific regions of the genome — single-nucleotide 
polymorphisms or SNPs — associated to a given trait enhances the understanding of 
the trait, and, in the case of diseases, facilitates prevention and treatment. Genome- 
wide Association Studies (GWAS) are a powerful statistical tool for locating the 
SNPs involved in the control of a trait [Lauc et al. 2010; Levy et al. 2009; Speliotes 
et al. 2010]. The simultaneous analysis of many phenotypes is the objective of the 
so called multi-trait GWAS. 

Every year, computational biologists publish hundreds of GWAS-related pa- 
pers [Hindorff et al. ], with a clear trend towards analyses that include more and 
more SNPs. Ideally, the computational biologists aim at testing the whole human 
genome against as many traits as possible. Mathematically, for each SNP Xi and 
trait yj ( i G [1 . . .m],j e [1 . . . i]), one has to solve the generalized least-squares 
problem (GLS) 



— yj is the vector of observations; 
— Xi is the design matrix; 

— Mj is the covariance matrix, representing dependencies among observations; and 
— the vector 6^ expresses the relation between a variation in the SNP {Xj) and a 
variation in the trait {yj). 

In multi-trait GWAS, the covariance matrix satisfies 



b l3 := ( Xj MJ 1 X l )~ 1 Xj MJ 1 y j 



(1) 



where 



M r .= a]-{h^ + (l-hj)I), (2) 
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Fig. 1: Interpretation of GWAS as a two-dimensional sequence of generalized least-squares prob- 
lems (b := (X T M~ 1 X )~ 1 X T M~ 1 y). GWAS with multiple phenotypes requires the solution of 
m X t correlated GLS problems, originating a three-dimensional object B of size m X t X p. As 
suggested by the colors, along the t direction the covariance matrix M and the phenotype y vary, 
while the design matrix X does not; conversely, in the m direction, M and y are fixed, while X 
varies. The colors also hint at the fact that X can be viewed as consisting of two parts, Xl and 
Xji, where the former is constant across the entire grid and the latter changes along m. 

where 

— the kinship matrix <!> contains the relationship among all studied individuals, 
— a 2 and h 2 are trait-dependent scalar estimates, and 
— / is the identity matrix. 

The number of SNPs, m, ranges between 10 6 and 10 8 (180.000.000 for the full 
human genome), and the number of traits, t, is either 1 (single-trait analysis), or 
ranges between 10 4 and 10 5 . 

In Fig. 1, we provide a visual interpretation of the problem: Each point in the 
grid, i.e., the vertical vector of size p at position (i,j), corresponds to the solution 
of one GLS (association between the i-th SNP and the j-th trait). A column in 
the figure represents a single-trait analysis (association between all the SNPs and a 
given trait); in this case, x/j as well as h 2 and cr?, and therefore Mj, are fixed. The 
full grid depicts the whole multi-trait analysis; as expected, y, h 2 , a 2 , and M vary 
along the t dimension, while the SNP Xi is fixed; the kinship matrix $ is constant 
throughout the two-dimensional grid. 

In linear algebra terms, Eq. (1) solves a linear regression with non-independent 
outcomes, where by £ 1Z P , Xi £ JZ nxp is full rank, My <= fz nxn is symmetric 
positive definite (SPD), yj £ 1Z n , $ £ U nxn is symmetric, I £ TZ nxn , and a 2 and 
h 2 £ 1Z. Moreover, the design matrix Xi presents a special structure: each Xi may 
be partitioned as (X L \ X Ri ), where X L £ i?" x (p- 1 ) is the same for all AYs, while 
X Ri £ R nxl varies. The sizes are 10 3 < n < 10 4 and 2 < p < 20. The quantities 
Xi, $, h 2 , a 2 , and yj are known, and the vector 6^ is to be computed. 

2.1 Related Work 

The black-box nature of traditional numerical libraries limits them to offering rou- 
tines for the zero-dimensional case, a point in the grid, and use such kernel repeat- 
edly for each problem in the sequence. As discussed previously, such an approach 
is absolutely unviable. Instead, GWAS-specific tools, such as GenABEL and FaST- 
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LMM, incorporate and exploit knowledge specific to the application. However, 
these tools still present limitations in all three of the GWAS challenges: f ) they are 
tailored for a one-dimensional sequence of problems, individual columns of Fig. 1; 
for the solution of the entire two-dimensional grid they only offer the possibility of 
repeatedly using their one-dimensional solver. From a multi-trait GWAS perspec- 
tive, this still represents a black box approach, missing opportunities for further 
computation reuse; 2) they incorporate rudimentary out-of-core mechanisms that 
lack the overlapping of data transfers with computation, thus incurring a consider- 
able overhead; and 3) they attain only poor scalability. 

2.2 Terminology 

Here we give a brief description of the acronyms used throughout the paper. 

— GWAS: Genome- Wide Association Studies. 
— GLS: Generalized Least-Squares problems. 

— GenABEL: A framework for statistical genomics, including one of the most widely 

used packages to perform GWAS. 
— gwfgls: GenABEL's state-of-the-art routine for GWAS. 
— FAST-LMM: The most recent high-performance tool for GWAS. 
— mt-GWAS: our novel solver for multi-trait GWAS. 

Additionally, we will reference many times 

— BLAS (Basic Linear Algebra Subprograms) [Dongarra et al. 1990], and 
— LAPACK (Linear Algebra PACKage), 

as the de-facto standard libraries for high-performance dense linear algebra com- 
putations. 

3. THE MULTI-TRAIT GWAS ALGORITHM 

mt-GWAS, a novel algorithm for multi-trait GWAS is introduced: First we describe 
a simplified version that solves one single GLS instance, and then show how the 
algorithm can be tailored for Eqs. (1) and (2), the solution of the whole GWAS. 
We stress that while mt-GWAS is suboptimal for the solution of one single GLS, it 
exploits the specific structure and properties of the full problem, and dramatically 
reduces its computational complexity, thus achieving remarkable speedups. 

3.1 Single-instance 

The fastest approach for Eq. (1) involves computing the Cholesky factorization of 
the matrix M [Paige 1979], for a cost of ^ flops. In the context of GWAS, in which 
M varies with the index j (Eq. (2)), such a factorization has to be performed t = 
10 4 -10 5 times, thus originating a computational bottleneck. Instead, the main idea 
behind mt-GWAS is to take advantage of the fact that M results from scaling and 
shifting the matrix which remains constant for all i's and j's. Then, computing 
the eigendecomposition ZAZ T = <& [Z orthogonal, A diagonal), 

Mj : = o](h)ZkZ T + (1 - h))I) 

= ZD 3 Z T , where Dj = <r?(h?A + (1 - h?)I). 
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While the eigendecomposition is much more expensive than the Cholesky factor- 

nf. 
3 



ization, ^-n 3 vs. ^ flops, it allows us to express Af ? 1 , for any j, with only 0(r 
additional flops: 

Mr 1 = ZDJ 1 Z T . 

After this initial step, Eq. (1) can be rewritten as 

bij := {XjZD- 1 Z T X i y 1 XjZDj 1 Z T Vr 

Since Mj is SPD, its eigenvalues Dj are all positive. Accordingly, the algorithm 

may take advantage of the symmetry of the expression XfZDjZ T X il and save 

computations. This is accomplished by computing the inverse of the square root of 

_i 

each entry of Dj (Kj := D- 2 ), resulting in 

:= {Xj ZK.Kj Z T X^Xf ZK 3 Kj Z T y 3 . 
Next, the products X[ := Z T Xi and y'j := Z T yj are computed 
hj := (X^K j K T Xi)- 1 X^K j K T y' j , 



and also Wjj := if J A| and Vj := Kjy 3 



hj ■ (WZWi^WZvi 



HJ ■— \ " ij " I] ) " ij U J- 

What remains is an ordinary least-squares problem. Numerical considerations 
allow us to safely rely on forming S^j :— WfjWij without incurring instabilities. 
The algorithm completes by computing hj :— WfjVj and solving the corresponding 
SPD linear system hj '■= ^ hj. mt-GWAS is detailed in Algorithm 1. 

Algorithm 1: mt-GWAS for the solution of a single 
instance of the generalized least-squares problem. 

ZAZ T = $ 
2 Dj :=a?(h 2 A + {l-h 2 )I) 

4 '.= Zt^ ' Xi 

5 y'j := Z T yj 

e Wij : KjX[ 
r Vj : Kjy'j 

. s l3 : ir,' y ir„ 

• K ■ W ij V 3 

io hj := S~j hj 



3.2 Tailoring for the two-dimensional sequence 

We now discuss how to extend Algorithm 1 for the solution of the two-dimensional 
sequence of GLSs specific to GWAS. The general objective is to identify opportu- 
nities for reusing partial calculations across different problems, in order to reduce 
the overall cost. 
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As a first step, we expose further application-specific knowledge: between any 
two matrices X^ and Xi 2 , only the right portion changes. In Algorithm 2, every 
appearance of Xi is replaced with its partitioned counterpart (Xl \ X^), and the 
partitioning is propagated. The subscripts L, R, T, and B stand for Left, .Right, 
Top, and Bottom, respectively. Since S is symmetric, the star in the top-right 
quadrant indicates the transpose of Sbl^ > an d this quadrant does not need to be 
either stored or computed. This intricate refinement of the algorithm is absolutely 
necessary to expose for each operand, which portion varies along which dimension. 



Algorithm 2: Single-instance version of mt-GWAS; 
the structure of Xi is exposed. 



D, :=a](h]A+{l-h])I) 



K jK j 



ZKZ T = $ 

3 

(X> L \X> Ri ):=ZT(X L \X Ri ) 

y' 3 ■■= z T Vj 

(W Lj \W Rij ):=Kj(X' L \X' Ri ) 
v 3 : Kfy>. 



StLj 









wlw Lj 









Next, we wrap Algorithm 2 with a double loop, corresponding to the traversal of 
the two-dimensional m-t grid, and reorganize the operations, aiming at eliminating 
redundant computation. Both traversals of the grid — by rows (for i, for j) and 
by columns (for j , for i) — are provided, in Algorithms 3 and 4, respectively. 
We will use the latter to describe how the computation can be rearranged; the same 
reasoning applies to the former. 

For each operation in the algorithm, the dependencies on the indices i and j are 
determined by the left-hand side operand(s). Any operation whose left-hand side 
does not include any subscript is invariant across the two-dimensional sequence; 
therefore it can be computed once and reused in every other iteration of the loops. 
This is the case for the eigendecomposition of $, and the computation of X' L . 
Operations that only vary across the t dimension (subscript j), are performed once 
per iteration over t — the outer loop — and reused across the iterations over m 
— the inner loop — . Finally, the operations labeled with i or i,j are placed in the 
innermost loop. 
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Algorithm 3: Solution of the 
two-dimensional grid of GLS problems 
depicted in Fig. 1. Traversal by rows. 

ZKZ T = $ 
X' L := Z T X L 
for 1 < i < m 

for 1 < j < t 

D :=a^K + {l-h))I) 

K jK J = Dji 



W Lj :=KJX' L 



Vj := Kfy> 
Stl, :=Wl.W Li 

Sbl, 3 ■■ u /. w L 

>>r : 11/ r , 

'>/>• ■ H /' 

bij := S i; j bij 



Algorithm 4: Solution of the 
two-dimensional grid of GLS problems 
depicted in Fig. 1. Traversal by columns. 

1 ZAZ T = $ 

2 Xj^ := Z^Xl 



IB 
16 



for 1 < j < t 



* £) j := ff |(/ lj 2 A+ (l-ft?)J) 



K jK J = Dp 

y'i ■= .</, 

W Lj := 2^ 

"j := Kjyi 
S TL] : U / U , 
b'r 

for 1 < i < m 



S BLij -W^ Wl. 
Sbr : Wf. II // 

^ : ii /!■; 

Ojj := fey 



As the reader might have noticed, the algorithm still performs redundant compu- 
tations: Lines 6-9, 11-12, and 15 in Algorithm 3 and line 12 in Algorithm 4 depend 
only on the dimension traversed by the inner loop, and are therefore recomputed 
at each iteration of the outer loop. This flaw can be resolved by precomputing 
the quantities outside of the double loop, and then accessing them from within 
the loops. While mathematically and algorithmically possible, the approach poses 
practical constraints on Algorithm 3, as it would require a fairly large amount of 
extra storage. The solution is instead applicable in Algorithm 4: The operation 
X' R , := Z T X R . may overwrite X Ri , making the size of temporary storage negli- 
gible. Let us stress the significance of this improvement: by avoiding redundant 
calculations, Algorithm 4 saves 2tmn 2 flops, thus drastically lowering the cost with 
respect to the other state-of-the-art algorithms, and eliminates the main computa- 
tion bottleneck. Henceforth, Algorithm 4 is our algorithm of choice. 

Practical considerations allow us for one more optimization. Notice that op- 
eration X' R . := Z T Xn i is performed independently for each X Ri: resulting in m 
matrix-vector multiplications. Instead, these matrix-vector operations may be bun- 
dled together in a single large matrix-matrix product, known to be a much more 
efficient operation. Similar reasoning applies to the operation at line 6, y'j :— Z T yj. 

mt-GWAS, the final algorithm that includes all these optimizations, is provided in 
Algorithm 5. There, Xr and y are used to represent the collection of all Xrs and 
all y% respectively: X R = (X Rl | X R2 \ ... \ X Rm ) and y = (y 1 | y 2 \ ■■■ \ y t ). 
The second and third columns indicate, respectively, for each individual operation, 



ACM Transactions on Mathematical Software, Vol. V, No. N, Month 20YY. 



Computing Petaflops over Terabytes of Data: The Case of GWAS 



9 



the corresponding BLAS or LAPACK routine and its associated cost. 





Algorithm 5: 


MT-GWAS 




i B : 


:= MT-GWAS( X L , X R , y, ti$ , 


* ) 




2 


ZAZ T = $ 


(eigendec) 


in q 

fn 3 


3 


X' L := Z T X L 


(gemm) 


2n 2 (p- 1) 


4 


A. R := Zj A.R 


(gemm) 


in to 


5 


"»)/ 7Tn 

y := Z J y 


(gemm) 


2rri 


6 


1UI 1^1 \ l 






7 


Dj :=(7?(/iU+(l-/i?)J) 


(scalar-op) 


2/i 




KiK T = L>~ 5 


('scalar-orA 




9 


W Lj :=KjX' L 


(scalar-op) 


(p — l)n 


10 


v s : A'/ //: 


(scalar-op) 


n 


11 


<? Ti . : U / II . 


(syrk) 


(p-lfn 


12 




(gemv) 


2(p- l)n 


13 


for 1 < i < m 






14 


W Rij : /V/.Y'„ 


(scalar-op) 


n 


15 


•V/;. : H /l. U ,. 


(gemv) 


2(p- l)n 


16 


S B R : U / lf„ 


(dot) 


2n 


17 


''/.' •= »/' 


(dot) 


2n 


18 




(posv) 


o( P 3 ) 



3.3 Computational cost 

Let us now compare the computational cost of mt-GWAS with that of the aforemen- 
tioned alternatives: LAPACK, FaST-LMM, and GenABEL. To this end, we recall 
the size of the input and output operands: 

-lieJi"'^ 1 ', —yeR nxt , $ei? nx ™, 

— X R e R nxm , —h% a 2 eR, -Be R mxtx P _ 

Typical values for these dimensions are: 
— 10 3 < n < 10 4 , — 10 6 < to < 10 8 , 

— 2 < p < 20, — 10 4 < t < 10 5 . 

The asymptotical cost of mt-GWAS (see third column in Algorithm 5) is 0(?i 3 + 
mn 2 + tn 2 + tmpn). Since in a typical scenario for multi-trait GWAS m and t 
are much larger than n, the dominating factor is Oitmpn). By contrast, the cost 
for the traditional library approach of LAPACK, which optimizes for a single GLS 
(0(n 3 )) and uses it for each point in the two-dimensional grid, is 0{tmn 3 ); the 
cost for state-of-the-art tools — FaST-LMM and GenABEL — , which optimize for a 
one-dimensional analysis (0(mn 2 )) and use it for each column in Fig. 1, is 0(tmn 2 ). 

Table I collects the mentioned costs together with the ratio with respect to mt- 
GWAS. The message is clear: No matter how optimized a solver for a single GLS 
or for a one-dimensional analysis is, it cannot compete with a solver specifically 
tailored for the entire multi-trait analysis. 

mt-GWAS exploits the specific structure of the operands and the correlation 
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Computational cost 


Ratio over mt-GWAS 


LAPACK 


0(tmn 3 ) 




GWFGLS 


0{tmn 2 ) 


o(f 


FAST-LMM 


0(tmn 2 ) 


o(f 


MT-GWAS 


0(tmnp) 


1 



Table I: Asymptotic cost of each of the discussed approaches to multi-trait GWAS. 
The ratio over MT-GWAS illustrates the impact of exploiting increasing levels of 
correlation within GWAS. MT-GWAS improves the cost of state-of-the-art tools by 
a factor of 0(f). 

among GLSs, and lowers the cost of the best existing methods by a factor of 0(f. 
For a problem of size n — 1,000, p — 4 and large m and t, we can expect mt-GWAS 
to be around two orders of magnitude faster than fast-lmm and GWFGLS. 

4. OUT-OF-CORE: ANALYSIS OF COMPUTATION AND DATA TRANSFERS 

In addition to the formidable computational complexity, GWAS poses a second 
challenge: the management of large datasets. In the prospective scenario in which 
m = 36,000,000, t = 300,000, and n = 10,000, the size of input and output data 
amounts to tens and hundreds of terabytes, respectively. Current analyses already 
involve the processing and generation of few terabytes of data. 

Obviously, present shared-memory architectures are not equipped with such an 
amount of main memory; the size of the datasets thus becomes a limiting factor. 
In order to overcome this limitation, we turn our attention to out-of-core algo- 
rithms [Toledo 1999; Grimes 1988; Agullo et al. 2007]. The goal is to design an 
algorithm that makes an effective use of the available input /output (I/O) mecha- 
nisms, to deal with data sets as large as the available secondary storage, and to 
minimize the overhead due to data transfers. 

In the general case, the operands Xr, y, and B are too large to fit in main 
memory and need to be streamed from disk to memory and vice versa. A naive ap- 
proach, which too often becomes the choice for actual implementations, is sketched 
in Algorithm 6. The algorithm loads one y and one Xr and stores one b at a time, 
resembling an unblocked algorithm. A quick study of the computation and data 
transfers suffices to understand the poor quality of the approach. Let us assume 
the following problem sizes: n = 1,000, p = 4, m = 10 6 , and t = 10 5 . On the one 
hand, this data set requires the performance of approximately 1.1 petaflops; at a 
rate of 25 GFlops/sec (see Sections 5 and 6 for details), it would complete in about 
12 hours. On the other hand, loading t times the whole operand Xr (nxmxtx8, 
assume 8 byte, double precision, data) already causes 800 TBs of traffic between 
disk and main memory. At a peak bandwidth of 2 GB/sec, which is by no means 
reached due to the small size of the transfers, the data movement alone takes 111 
hours, about 10 times more than the time spent in actual computation. In order to 
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attain an efficient out-of-core design, one has to a) study how different parameters 
affect the ratio between computation and data movement, aiming at reducing the 
latter, and b) use techniques for overlapping I/O with computation, thus leading 
to a complete elimination of I/O overhead. 

Algorithm 6: Sketch of a naive out-of-core scheme for mt-GWAS. 

i B:= mt-GWAS ( X L , X R , y, , a), $ ) 
[...] 

3 for 1 < j < t 

4 load yj 
[...] 

e for 1 < i < m 

7 load X Ri 

[...] 

9 store b,;A 



The key behind a drastic reduction of data movement is data reuse; this is com- 
monly attained by blocked algorithms or, in the context of out-of-core, tiled algo- 
rithms. The idea consists in loading not one single y and Xr at a time, but many 
of them, in a chunk or slab] once loaded in memory, each of the elements in the 
slabs can be used repeatedly. Following this approach, the cost of an expensive 
operation such as a disk-memory transfer, is amortized by performing many more 
arithmetic operations per transferred element. 

The ratio of computation over transferred data 

# fl ° pS (3) 

data_to_load + data_to_store 

gives an approximate idea of the potential for the minimization of the impact of the 

data transfers. Since the time for loading an element from disk is much larger than 

performing a scalar operation, large ratios are desired. When applied to a concrete 

situation, the ratio r exposes a number of parameters or degrees of freedom that 

may be adjusted to improve the ratio. In the context of Algorithm 5, these degrees 

of freedom are the number of y's and Xrs loaded at a time. 

The above ratio may be extended to incorporate the concept of overlapping. For 

the algorithm to completely hide the I/O under computation, the time spent in 

computation must be larger than the time for loading and storing data. Hence, the 

inequality 

computation_time > ID_time 
must hold. The inequality may be refined as 

# flops data_to_load + data_to_store . , 

> • ( 4 ) 

# flops/sec ICLbandwidth 

Inequality (4) enables the identification of (ranges of) values for the aforementioned 
degrees of freedom, so that a perfect overlapping is achieved. 

As an overlapping out-of-core mechanism we choose the so-called double-buffering. 
In short, to deploy double-buffering, main memory is divided into two workspaces. 
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Each workspace contains buffers, one per operand to be streamed; in this case, two 
buffers for the input operands Xr and y, and one for the output stream B. At a 
given iteration over the streams, one workspace is used for computation, while the 
other one is used for downloading previous results and uploading data for the next 
iteration. After each iteration, the workspaces swap roles. For more details, we 
refer the reader to [Fabregat-Traver et al. 2012]. 

The rest of this section is dedicated to the analysis of the computation and data 
transfer required by Algorithm 5. During the analysis we expose the degrees of 
freedom, and provide the specific constraints they must satisfy. 

4.1 Preloops: Single sweep over the input streams 

We logically divide Algorithm 5 in two sections: 1) preloops — lines 1 to 4 — , and 
2) loops — lines 5 to 17 — . In the preloops, the operations at lines 1 and 2 involve 
operands that fit in main memory; they are therefore computed through direct 
calls to the corresponding BLAS and LAPACK routines. On the contrary, the two 
matrix products at lines 3 and 4 involve the operands Xr and y, which reside on 
disk, and must therefore be performed in a streaming fashion. 

The approach to compute X' R := Z t Xr (line 4), and equivalently y' := Z T y 
(line 3), consists in a traversal over the stream Xr in slabs containing nb Xr's, 
where the optimal value for nb is to be estimated. At every step over the stream 
(Fig. 2a), the algorithm 

(1) loads nb Xr's, each of them of size n, i.e., n x nb elements; 

(2) stores nb X'^s, each of them also of size n; and 

(3) performs 2 x n 2 x nb flops, corresponding to the matrix product X' R . := Z T XR i . 

The ratio of computation over data movement is 

2 x n 2 x nb 
(n x nb) + (nx nb) ~ 

For typical values of n — from one thousand to tens of thousands — , such a ratio 
shows great potential for perfect overlapping. 

Notice that, even though the ratio is independent of the value of nb, two quantities 
in (4) are influenced by this parameter: the performance of GEMM (# flops/sec), 
and the data transfer rate (lD_bandwidth). The objective for both quantities is 
clear: One wants to maximize them both, to minimize execution time and I/O 
time. The constraint to satisfy is 

2 x n 2 x nb 2 x n x nb x sizeof (datatype) 

; > 7 (5) 

# flops/sec ICLbandwidth 

which simplifies to 

n sizeof (datatype) 

# flops/sec ICLbandwidth 

Since these quantities are specific to the architecture and the problem, we defer 
their evaluation to Sec. 6, where the experimental setup is defined. 
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(a) Single-sweep traversal of the streams 
X R and y in slabs of nb. 



(b) Computation of object B of size mxt 
decomposed into tiles of size (mb x tb). 
Traversal by columns. 



Fig. 2: Meaning of the degrees of freedom in the application of double-buffering to MT-GWAS. 



4.2 Loops: Cross product of the input streams 

The analysis and application of double-buffering to the second section of the algo- 
rithm requires a deeper discussion. Instead of a single sweep through the streams, 
this section operates on every pair (X^, yj) of data, i.e., a "cross product" of 
the input streams Xr and y. The problem can be translated into how the object 
B is built. In view of the impossibility of fitting the entire data set in memory, 
the problem of computing B is decomposed into the computation of smaller parts 
or tiles (Fig. 2b). Let us assign each tile the size mb x tb, where mb and tb are 
the number of Xrs and j/'s processed, respectively. Henceforth we concentrate on 
determining adequate values for these two parameters. 

As for the first section of the algorithm, the goal is to achieve a perfect overlapping 
of I/O with computation. To this end, we must once more study the ratio of 
computation over data movement as a function of the tile size. To compute a tile, 
the algorithm must load (mb x n) + (tb x n) elements, corresponding to the loading 
of a slab of Xrs and a slab of y's, and store (mb xtbxp) elements, corresponding 
to a slab of B. Per tile, 0(mb x tb x n x p) flops are performed, for a ratio of 

0(mb x tb x p x n) 



(mb x n + tb x n) + (mb xtbxp)' 

In Section 3.2 we deduced that a traversal by columns of the grid depicted in 
Fig. 1, is favorable in terms of computational cost and temporary storage. Hence, 
the algorithm will load a slab of and reuse it for all Xrs. Consequently, the 
cost of loading the slab of y } s can be neglected, and the ratio simplifies to 

0(tb x p x n) 

r = ; . 

n + tb x p 

This ratio r is independent of the value mb. Intuitively, if mb is multiplied by 2, 
both the amount of computation within the tile and the required I/O are doubled 
(twice the number of Xr's are loaded and twice the number of fe^ 's are computed 
and stored). On the contrary, the ratio grows monotonically with tb. For the sake 
of clarity, we illustrate in Fig. 3 the behavior of r as a function of tb. For tb = 1, r 
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2000 4000 6000 8000 10000 

tb 

Fig. 3: Ratio of computation over data transfer for tiles of B. The ratio is plotted as a function 
of tb, with p = 4 and n = 1000. The ratio is initially very low but grows rapidly to values that 
allow a perfect overlapping of I/O with computation. 



is 0(p); the ratio rapidly grows with tb until it reaches a value of about 2n, from 
where the growth is much smaller. As discussed previously, large values of the ratio 
are favored. Therefore, tiling along the t dimension (tb > 1) is imperative to reduce 
the I/O overhead. 

However, since we are making use of double-buffering, we do not need to choose 
the largest possible tb; we only require tb to be large enough to completely hide the 
overhead due to I/O. The minimum value of tb leading to a perfect overlap can be 
determined analytically via (4). The time for computing a tile must be larger than 
the time for loading a slab of Xr and storing the corresponding slab of B: 

mb x tb x (5 + 2(p -l))xn (mb x n + mb x tb x p) x sizeof (datatype) 
# flops/sec 10 bandwidth 

Dividing both sides by mb: 

tb x (5 + 2(p — 1)) x n ^ (n + tb x p) x sizeof (datatype) 
# flops/sec ID bandwidth 

In Eq. (7), the values for n and p are given as input; the datatype is known; 
the hard-drive bandwidth is to be determined empirically; and the performance 
attained in the computation of a tile is determined when tailoring for the archi- 
tecture, as discussed in Section 5. As we observed for nb, the parameter mb does 
not influence the reduction of data transfer, and it has to satisfy no explicit con- 
straint. Still, tiling along m (mb > 1) is recommended for performance reasons, as 
we demonstrate in Section 5. 

Let us summarize the study performed in this section. We exposed and analyzed 
the degrees of freedom in the tiling of our algorithm: nb, mb, and tb. The main 
message of the study is that tb is the key in the reduction of the data movement 
to make the computation feasible. Additionally, we discussed the implication of 
the value of all three parameters in terms of performance, and stated the two key 
constraints — Eqs. (6) and (7) — to be satisfied in order to completely eliminate 
overhead due to I/O operations. 
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5. TAILORING FOR SHARED-MEMORY PARALLELISM 

Algorithm 5 (described in Sections 3 and 4) both has a lower computational com- 
plexity than all the current alternatives and eliminates the overhead due to I/O. 
Yet, a careful implementation is required to exploit shared- memory parallelism. In 
this section, we explain how to select the best-suited type of parallelism for each 
section of the algorithm, and outline the steps to achieve almost optimal scalability. 

5.1 Preloop: Multi-threaded BLAS 

The operations in the preloop section, lines 2-5, correspond to either LAPACK 
routines (eigendecomposition) that cast most of the computation in terms of BL AS- 
3 kernels, or are direct calls to BLAS-3 routines (gemm). For this class of routines, 
it is well known that optimized multi-threaded BLAS libraries deliver both high- 
performance and scalability. This is therefore the solution we adopt. 

5.2 Loops: Single-threaded BLAS + OpenMP parallelism 

In sharp contrast to the preloops, the computation performed within the loops, lines 
6-18, maps to non-scalable BLAS-1 and BLAS-2 operations, thus making the use 
of multi-threaded BLAS not viable. Instead, we exploit the multi-core parallelism 
by utilizing a single-threaded BLAS in combination with OpenMP threads and by 
decomposing the computation in tasks. 

In Section 4, it was shown that tiling the computation of B along the t dimension 
is the key to eliminate penalties due to I/O data transfers. In this section we 
demonstrate that for a high-performance shared-memory implementation, it is also 
crucial to tile along the m dimension, and discuss how to select the best tile size and 
shape. Furthermore, we explore different multi-threading strategies to manage the 
data transfers, and to split and assign computational tasks to cores; since the total 
number of options is daunting, here we only describe those two that we considered 
most promising: 

(1) a single master thread is responsible for the streaming of all tiles, and all threads 
collaborate in the computation of each tile; and 

(2) each thread operates on whole tiles, and is responsible for their streaming. 

5.2.1 Two-dimensional tiling. If we only consider tiles of size 1 x tb, the largest 
tile is of size 1 x 100,000 (See Fig. 2b, with mb = 1 and tb = t). In our experimental 
environment with 40 threads, each thread computes a task or block of size 1 x 2,500, 
attaining poor efficiency: 1.54% Instead, by tiling (and blocking) along m as well 
as along t, efficiency increases: 

a) For a tile size of 200 x 100,000 with blocks of size 1 x 100,000, the attained 
efficiency is 3.75%. 

b) For a tile size of 1,000 x 100,000 with blocks of size 100 x 100, the efficiency 
raises to 6.35%. 

A higher workload per thread — case a) — results in a speedup of about 2.5x; by 
further increasing the value of mb — case b) — data locality improves, for speedups 
of about 4x. The benefits of using two-dimensional tiles are thus clear; it remains 
to determine how such tiles should be decomposed to maximize performance. 
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Fig. 4: Study of the performance attained by different block sizes (mbb X tbb). The red line (L3 
cache) delimits the blocks that fit in L3 cache — above the line — . The green region (Efficient 
region) includes block sizes attaining an efficiency of at least 6.7%. Most of the region fits in L3. 
The yellow region (Peak region) includes the most efficient blocks (efficiency above 6.8%). The 
fluctuation within the Peak region is less than 0.5%. 

5.2.2 Estimating the best block size. Typically, as long as the tile size satisfies 
the constraints (6) and (7), one chooses mb and tb to maximize the RAM usage. 
For performance purposes, the computation of such large tiles must be split into 
smaller blocks to exploit both the memory hierarchy and all the available cores. 
The same way the tile size is chosen according to the amount of available main 
memory, in cache-based architectures the block size is determined according to the 
size of the cache memories. The challenge lies on finding the optimal block size 
(mbb x tbb). 

We consider blocking for the highest level of cache. Fig. 4 provides a heatmap 
representing the efficiency attained varying the block sizes. The red line delimits 
the space of block sizes that fit in the last level of cache — level 3 (L3) for the 
architecture used in our experiments (see Section 6 for details) — . As anticipated, 
the block sizes attaining higher efficiency (green line: Efficient region) lie above the 
red line, i.e., fit in L3 cache. Also expected is the fact that relatively large square 
blocks attain the highest efficiency (yellow line: Peak region) . 

Interestingly, we observed that inside the Peak region performance plateaus, ex- 
hibiting variations below 0.5%. This suggests that one should focus on finding this 
region rather than the most efficient block size. As a matter of fact, since within 
the Peak region the fluctuations due to the hardware and the operating system are 
greater than the differences in performance, it can be argued that a best block size 
does not even exist. 

5.2.3 Work distribution. We conclude describing two possibly strategies to dis- 
tribute the blocks among cores. 
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(a) Cooperative threads: A master 
thread loads each tile, and all spawned 
threads cooperate on the computation of 
the tile. 



B 

(b) Independent threads: Each thread is 
responsible for the loading, computation 
and downloading of its own tiles. 



Fig. 5: The two studied approaches to work distribution. 

(1) Cooperative threads (mt-GWAS-Ct). A common scheme for tile-based imple- 
mentations of out-of-core algorithms consists of a master thread that takes 
care of all data transfers, and a number of spawned threads that cooperate in 
the computation of each tile. As Fig. 5a illustrates, each tile is divided into 
blocks, and these are distributed among computing threads, thus sharing the 
workload. 

(2) Independent threads (mt-GWAS-It). Alternatively, work may be distributed so 
that each thread is responsible for the loading, computation, and downloading 
of its own entire tiles. As Fig. 5b illustrates, each tile is still decomposed into 
blocks for performance reasons. 



6. EXPERIMENTAL RESULTS 

We focus now on experimental results. We compare the performance and scalability 
of the state-of-the art tools, fast-lmm and GWFGLS, with those of the presented 
algorithm mt-GWAS. For mt-GWAS we provide results for both parallel approaches 
described in Section 5.2.3: mt-GWAS-CT for cooperative threads, and mt-GWAS-it 
for independent threads. 

6.1 Experimental setup 

All tests were run on a SMP system consisting of 8 Intel Xeon E7-4850 multi- 
core processors. Each processor comprises ten cores, operating at a frequency of 
2.00 GHz, for a combined peak performance of 320 GFlops/sec. The system is 
equipped with 512GB of RAM and 4TBs of disk as secondary memory. The I/O 
system attains a maximum bandwidth of 2GBs/sec for data transfers of at least 
2MBs. 

The routines were compiled with the GNU C Compiler (gec, version 4.4.5), and 
linked to a multi-threaded Intel's MKL library (version 10.3). Our routines make 
use of the OpenMP parallelism provided by the compiler through a number of 
pragma directives. All computations were performed in double precision. 
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6.2 Configuring the degrees of freedom 

In order to attain maximum performance, we need to estimate the most effective 
values for the algorithm parameters: nb, mb, tb, mbb, and tbb. 

6.2.1 nb. As described in Section 4, this parameter is chosen to be greater 
than or equal to the minimum value that maximizes both data transfer rate and 
GEMM's performance. The maximum I/O bandwidth is attained for transfers of 
at least 2MBs; the minimum value of nb to reach this size is 250 (nb x n x 
sizeof (datatype) = 250 x 1000 x 8B = 2MBs). We also determined empiri- 
cally that the minimum value of nb to maximize GEMM's performance for 40 cores 
in the specified architecture is 10,000 (attaining 240 GFlops/sec). Therefore, we 
set nb to 10,000. Substituting in Eq. (6), we see that we will achieve a perfect 
overlapping: 

1000 8 a in a 
> = 4.17 > 4. 

240 x 10 9 2 x 10 9 

6.2.2 mbb and tbb. According to the results shown in Fig. 4, we simply choose 
a square block size within the Peak region: 160 x 160. 

6.2.3 mb and tb. We determined that the maximum performance attained in 
the computation of a tile is about 25 GFlops/sec. Substituting each variable in 
Eq. (7): 

tbx 11 x 1000 8000 + £6x32 , _ 

25 x 10 9 > 2 x 10 9 55 tb>1 °- 

Therefore, and given the chosen block size, the constraint above (tb > 10) does not 
impose any restriction in our choice of tile size (tb will be at least 160) . In any case, 
we emphasize the importance of always carrying out such an analysis, as differences 
in performance or I/O bandwidth could lead to more restrictive constraints. 

As discussed in Section 5, we choose different tile sizes for the different approaches 
to work distribution. In the case of MT-GWAS-CT, we consider large square tiles 
maximizing the usage of free main memory; accordingly we choose mb = tb = 
25,600. For mt-GWAS-it, wc fix the value of tb to that of the block size, i.e., tb = 
tbb = 160, while for mb we select a small multiple of the block size (mb = 1600). The 
reason for such a small tile size is to show that our routine can achieve impressive 
speedups even with a limited amount of main memory, emphasizing that the only 
restriction for mt-GWAS is the size available amount of secondary device storage. 
As an example, for the execution of the largest test reported in this section, which 
involves the processing of more than 3 terabytes of data, mt-GWAS-it only required 
less than 2 gigabytes of memory. 

6.3 Experiments 

We present performance results for fast-lmm, GWFGLS, mt-GWAS-CT, and mt- 
GWAS-it. At first, results for a single core are given, to highlight the speedup due 
to the sole improvement of the algorithm, i.e., the reduction in computational cost. 
Then, we compare the scalability and performance of all four algorithms, using all 
the available 40 cores. 

In Fig. 6, we provide timings for the four routines for increasing values of t. The 
experiments were run using a single thread. The gap between the statc-of-the- 
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Fig. 6: Performance of the single-threaded versions of the four presented routines. The problem 
dimensions are: n = 1,000, p = 4, and m = 100,000. For t = 10,000, our routines — MT-GWAS-CT 
and MT-GWAS-IT — outperform the state-of-the-art tools FAST-LMM and GWFGLS by a factor of 50 
and 43, respectively. 



art routines and our novel algorithm is substantial: while fast-lmm and GWFGLS 
would take, respectively, about 10 and 8.5 days to complete, mt-gwas-ct and mt- 
GWAS-it reduce the execution time to 5.3 and 4.8 hours, respectively. In terms of 
speedups, our best routine, MT-GWAS-it, is 50 and 43 times faster than fast-lmm 
and GWFGLS, respectively. It is also worth noticing the 10% speedup of MT-GWAS-it 
with respect to MT-GWAS-CT. 

Scalability results are shown in Fig. 7. The figure reveals clear scalability is- 
sues in the parallel implementations of both FAST-LMM and GWFGLS; they achieve 
speedups between 1.4 and 1.6, and plateau when 16 or more cores are used. In- 
stead, the results for our routines clearly demonstrate the benefit of carefully tai- 
loring Algorithm 5 for large shared-memory architectures: Using all available 40 
cores, MT-GWAS-CT and mt-GWAS-it attain speedups of 35 and 36.6, respectively. 
Furthermore, the trend presented by both our routines forecasts larger speedups, 
should more cores be available. 

Figure 8 presents performance results for the same four routines when using 40 
threads. Here, the effects of the computational cost reduction, the perfect I/O 
overlapping, and a better scalability are all combined, yielding speedups of 1,352 
and 1,012 over FAST-LMM and GWFGLS, respectively. The time to compute the 
largest presented problem, n = 1,000, p = 4, to = 10 6 , and t = 10 5 , is reduced from 
(unfeasible) years to 12 hours. 

7. FUTURE WORK 

Multi-trait GWAS is constrained by three dimensions: n, ranging from 10 3 to 10 4 , 
to, ranging from 10 6 to 10 s , and t, ranging from 10 4 to 10 5 . The work presented in 
this paper allows m and t to grow as large as desired. On the contrary, our routines 
assume the operand $ g ^™ x ™ to fit in main memory and thus are constrained 
by the size of n. For very large values of n, the problem demands a distributed- 
memory version of mt-GWAS, and therefore requires an extension of the analysis of 
data transfers and work distribution undergone in this paper. 

Additionally there is an increasing demand for support of co-processors such 
as GPUs. While the use of GPUs was proven successful for the single-trait case 
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Fig. 7: Scalability of MT-GWAS-CT, MT-GWAS-IT, FAST-LMM, and GWFGLS. While FAST-LMM and 
GWFGLS present a mediocre scalability, our routines — MT-GWAS-CT and MT-GWAS-IT — attain 
speedups of 35x and 36. 6x, respectively, when using 40 threads. The problem dimensions are: 
n = 1,000, p = 4, m = 100,000, and t = 20,000. 
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Fig. 8: Performance of the multi-threaded versions. The problem dimensions are: n = 1,000, 
p = 4, and m = 1,000,000. While, for t = 100,000, FAST-LMM and GWFGLS are not viable, our 
routines complete in a matter of hours. The observed speedups are larger than lOOOx. 



(t = 1) [Beyer 2012], routines for the more general multi-trait case are not yet 
available. There, the challenge lies in writing optimized kernels for the computation 
within the loops, tuning for the intricate memory hierarchy of the architecture. 

8. CONCLUSIONS 

We addressed an extremely challenging and widespread problem in computational 
biology, the genome-wide association study (GWAS) with multiple traits. GWAS 
involves large-scale computations — petaflops — on large data sets — terabytes of 
data — , and the existing state-of-the-art tools are only effective in conjunction with 
supercomputers. In this paper we introduced mt-GWAS, a novel algorithm for se- 
quences of least-squares problems, tailored to take advantage of both application- 
specific knowledge and shared memory parallelism, and demonstrated that for per- 
forming the full GWAS analysis, a single multi-core node suffices. 
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First, we described the derivation of an algorithm that exploits all knowledge 
available from the application: from the specific two-dimensional sequence of gen- 
eralized least-squares problems, to the special structure of the operands. By elim- 
inating redundant computations, this algorithm lowers the asymptotical cost of 
state-of-the-art tools by several orders of magnitude. 

Then, we discussed how to deal with large-scale datasets. In order to incorpo- 
rate an out-of-core mechanism, we analyzed the ratio between data movement and 
computations, and derived the best tile size and shape for a perfect overlapping of 
data transfers with computation. This mechanism enables the processing of data 
sets as large as the available secondary storage, without any overhead due to I/O 
operations. 

Finally, we tailored our algorithm for shared-memory parallel architectures. The 
study of the different sections of the algorithm suggested the use of a mixed paral- 
lelism: 1) multi-threaded BLAS, and 2) single-threaded BLAS and OpenMP par- 
allelism. We empirically estimated the best size for the computational tasks, and 
studied two different approaches to distribute those task among threads. The re- 
sulting routines attain speedups of 35x and 36. 6x on 40 cores. 

By combining the effects of the computational cost reduction, the perfect I/O 
overlapping, and a high scalability, our routines yield, when compared to the state- 
of-the-art tools, a 1000-fold reduction in the time to solution. Thanks to this 
algorithm, analyses that were thought to be feasible only with the help of super- 
computers, can now be completed in matter of a few hours with a single multi-core 
node. 
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